The Magician is the first multi-modal large language model with free-form multi-image localization capabilities, achieving precise localization in complex multi-image scenarios and outperforming models with a scale of 70B in performance.
Text-to-Image
Transformers English